Chainsaw: protein domain segmentation with fully convolutional neural networks

Abstract Motivation Protein domains are fundamental units of protein structure and play a pivotal role in understanding folding, function, evolution, and design. The advent of accurate structure prediction techniques has resulted in an influx of new structural data, making the partitioning of these structures into domains essential for inferring evolutionary relationships and functional classification. Results This article presents Chainsaw, a supervised learning approach to domain parsing that achieves accuracy that surpasses current state-of-the-art methods. Chainsaw uses a fully convolutional neural network which is trained to predict the probability that each pair of residues is in the same domain. Domain predictions are then derived from these pairwise predictions using an algorithm that searches for the most likely assignment of residues to domains given the set of pairwise co-membership probabilities. Chainsaw matches CATH domain annotations in 78% of protein domains versus 72% for the next closest method. When predicting on AlphaFold models, expert human evaluators were twice as likely to prefer Chainsaw’s predictions versus the next best method. Availability and implementation github.com/JudeWells/Chainsaw.


Introduction
Protein domains are generally defined as self-stabilizing units composed of several secondary structural elements that pack together to form a hydrophobic core.From an evolutionary perspective, the protein domain is the level at which homology and functional groups are understood.Structural protein domain databases such as CATH (Orengo et al. 1997), SCOP (Murzin et al. 1995), SCOPe (Chandonia et al. 2021), and ECOD (Cheng et al. 2014) are essential for advancing scientific understanding of the protein universe.In 2022, DeepMind released the AlphaFold models for over 200 million proteins, increasing the number of available structures by multiple orders of magnitude (Varadi et al. 2022).These databases present opportunities to discover novel domains, infer evolutionary links, and generate functional hypotheses but an essential first step towards these goals is to parse these 200 million structures into constituent domains with high accuracy.
Existing protein domain boundary prediction techniques can be broadly classified into two categories: sequence-based approaches and structure-based approaches.As expected, approaches utilizing structural input outperform those relying solely on sequence information (Shi et al. 2019).With the advent of high-quality predicted structures from AlphaFold2 (Jumper et al. 2021), obtaining a 3D structure is no longer a significant constraint.To integrate the 200 million AlphaFold models into protein domain databases, it is logical to exploit the predicted structure as an input for enhancing domain boundary prediction.Historically, most structurebased approaches have used unsupervised, heuristic algorithms applied to contact maps or pairwise residue distances.These approaches are grounded in the physical intuition that the density of contacts is higher within domains than between domains (Holm and Sander 1994, Alexandrov and Shindyalov 2003, Postic et al. 2017, Zheng et al. 2020, Cretin et al. 2022, Zhu et al. 2023).Although unsupervised methods can be effective, it is challenging to hand-design a heuristic that encompasses all cases.Other methods (Redfern et al. 2007, Zhang et al. 2023) augment the unsupervised approach with the ability to match against a library of known domains, using sequence or structure comparisons.For example, DPAM (domain parser for AlphaFold models) (Zhang et al. 2023) uses a fixed formula to assess prospective splits as a function of three inputs: inter-residue distance, AlphaFold's predicted aligned error, and predicted domain co-membership via matching against a library of known domains.Comparison-based methods are well suited to segmenting proteins containing only known domains but may underperform on proteins containing domains which are not easily recognized by comparison tools or which are not included in existing databases.
The growth of protein structure databases complete with domain annotations presents an opportunity to instead recast the domain segmentation problem as a supervised learning task.Deep learning models have the potential to capture complex structural relationships and exploit these to achieve higher accuracy than heuristic unsupervised methods.Previously proposed supervised domain segmentation methods have mostly relied on sequence inputs and consequently struggled to match the performance of unsupervised methods which segment a known or predicted structure directly (Shi et al. 2019).Other supervised approaches have relied on a per-residue boundary classification approach (Jiang et al. 2018, Shi et al. 2019, Mahmud et al. 2022).Somewhat similar to our proposed approach, Eguchi et al. train a convolutional neural network (CNN), herein EguchiCNN, to do image segmentation on protein structures represented by 2D distance maps (Eguchi and Huang 2020).The EguchiCNN architecture was primarily designed for the more specific task of classifying domain regions into one of the 38 CATH architectures.However, part of the pipeline includes a domain segmentation predictor which uses the same CNN architecture as their architecture classification model.This approach treats the domain segmentation problem as a multiclass classification of residues where each domain constitutes a separate class.Limitations of this approach are that it can only handle proteins up to 512 residues in length and it can only detect a maximum of eight domains.A recent supervised approach called Merizo (Lau et al. 2023) uses a transformer architecture with invariant point attention to directly cluster residues into domains based on both sequence and structure inputs.This method was shown to perform better than UniDoc (Zhu et al. 2023), SWORD (Postic et al. 2017), DeepDom Jiang et al. (2018), and EguchiCNN Eguchi and Huang (2020).However, Merizo was not trained on any single-domain CATH proteins, as such we find that it tends to over-split single-domain proteins (Section 3).
In this work, we introduce Chainsaw, a supervised learning approach to protein domain segmentation.Instead of predicting domain boundaries directly or considering each domain as a separate class, Chainsaw relies on a 2D CNN trained to estimate the probability that pairs of residues belong in the same domain (Fig. 1).Domain boundaries are derived from these pairwise co-membership probabilities using a greedy algorithm that searches for the most likely assignment of residues to domains given the predicted probabilities (Section 2.4).Formulating the supervised learning problem as a classification task at the level of pairs of residues rather than as a boundary prediction task at the level of individual residues has three notable advantages.Firstly, it makes the prediction of discontinuous domains more straightforward.Second, it sets no limit on the number of domains that can be predicted.
Finally, it improves the class imbalance problems associated with residue classification.Unlike methods such as EguchiCNN, Chainsaw can handle inputs of any size without cropping or padding.We show that Chainsaw achieves better domain parsing accuracy when compared with supervised methods Merizo and EguchiCNN as well as other leading unsupervised structure-based domain parsers (UniDoc, PUU, and SWORD) on held-out test sets of domain annotations from CATH and the critical assessment of structure prediction (CASP) competitions.We further evaluate Chainsaw on a random sample of AlphaFold models and find fewer domain prediction errors than the next best method.In a blind side-by-side human comparison of 200 AlphaFold models, we find the Chainsaw domain parsing to be preferable to UniDoc in roughly twice as many cases.Finally, we show that Chainsaw combined with Foldseek can be used to infer functional annotations in previously uncharacterized proteins.

Datasets
Following a similar approach to Merizo (Lau et al. 2023), we generated a train-validation-test split on CATH-annotated PDB files ensuring that no CATH superfamily is represented in more than one of the splits.We only include PDBs composed of domains belonging to CATH classes 1, 2, and 3. We note that the CATH superfamily is one level stricter than the S35 (35% sequence identity) clusters within the CATH hierarchy (CATH Database Team 2023).We opted not to use the same splits as Merizo, as these splits only contained multidomain chains and we know that approximately 45% of chains in the AlphaFold Database have one or zero domains (Lau et al. 2024).After splitting the CATH superfamily codes into train, test, and validation, we represent each PDB chain by a tuple which contains all of the CATH S35 codes of its constituent domains.For each unique tuple of CATH S35 codes, we take one PDB chain to represent that cluster in the validation and test sets.PDB chains which contained irregular amino acids or were missing α-carbon atoms were removed.Two additional chains, 2v495 and 3vkgB, were removed from the test set due to incorrect and incomplete CATH annotations.For the training data, we take one representative PDB file for each tuple of CATH S95 (95% sequence identity) clusters.To account for the additional training data redundancy introduced by this choice, during training we sample chains to train on in a redundancy-aware manner, by making use of CATH's sequence-identity-based clustering of domains.As such, one epoch is defined as one pass through all of the S60 sequence clusters (clustered at 60% sequence identity).We additionally experiment with varying the relative frequency with which single and multidomain chains are sampled as training data points.Our final approach sampled multi-domain proteins with a probability 0.65 and single domain with probability 0.35.Hyperparameter selection was based on performance on the validation set, final performance is reported on the non-redundant test set.To show that our method is not overfitting to CATH assignments and to test that the approach generalizes to alternative domain assignments we evaluate an additional test set using the domain assignments from CASP6 (Tai et al. 2005) (Supplementary Material).

Input features
The Chainsaw neural network takes a 3D structural representation of a protein, such as a PDB file.From the 3D structure, we generate five feature channels (Fig. 3) comprised of a residue pairwise distance matrix and four channels representing predicted secondary structure using STRIDE (Heinig and Frishman 2004).The pairwise residue distance matrix is an L×L matrix D where element d ij is the distance, in angstroms, between the α-carbon atoms of residues i and j.The predicted secondary structure is represented in two formats.The first is a co-membership matrix C where element c ij is 1 if residues i, j are in the same secondary structure component, 0 otherwise.The second format indicates which residues occur at the start and end of secondary structure components with the first residue of a secondary structure component indicated with 1 and the last residue − 1.Each of the secondary structure representations is instantiated independently for helices and strands resulting in four secondary structure feature channels in total.

Network architecture and training objective
We formulate the supervised learning problem as a 2D to 2D task: transforming the 2D input features into a pairwise probability matrix which expresses the probability that pairs of residues are in the same domain.As such, a fully convolutional architecture with skip connections is a natural choice.We use a modified version of the trRosetta architecture (Yang et al. 2020), a model originally developed for protein structure prediction.trRosetta is a residual network, whose blocks employ convolutional layers with progressively increasing dilation rates to achieve a wide receptive field (Fig. 2).We enforce a symmetric-output constraint by adding the transpose of the final layer to itself to give the final symmetrized Â.We truncate the trRosetta architecture to 31 blocks and reduce the number of filters from 64 to 32 but otherwise follow the original model in all details (Yang et al. 2020).The learning objective is to minimize the binary-cross entropy of the predicted residue pairwise domain comembership matrix (which we call the soft adjacency matrix) and the true adjacency matrix.One advantage of this pairwise representation in both inputs and outputs lies in its SE (3) invariance, signifying that the representation remains unaltered under rigid transformations of the 3D structure, which consists of rotations, translations, and reflections within the original coordinate space.An important theoretical motivation for using this representation is that a unique configuration of points in 3D space, up to rigid transformation, is fully specified by the 2D pairwise distance representation (see theorem 1 in Supplementary Material).To put this another way, the 2D representation captures all relevant spatial geometry while ignoring nuisance factors such as the arbitrary selection of the coordinate system that is used to represent the 3D structure.

Domain assignment algorithm
The output of the neural network is an L × L matrix Â, with entries âij representing the probability that residue i and residue j belong to the same domain.A further processing step is required to resolve the uncertainty in the neural network's prediction to derive a final domain assignment.Let A D be the binary matrix associated with a given domain assignment D.
To identify the most likely domain assignment D � given a predicted set of co-membership probabilities Â, we seek to find the D with maximum probability under Â, The probability of a domain assignment given the predicted values Â is given by the product of the entries where each âij represents the probability a ij ¼ 1.The representation A D has useful properties as a label for supervising the neural network (Section 3.1) (Fig. 3), however, at the stage where we search for the optimal assignment it is preferable to work with a low-rank factorization of A D , where each residue's domain assignment is represented as a one-hot encoded vector.
To perform the maximization, we use a greedy algorithm, which exploits the fact that A D has a low-rank structure induced by the domain assignment D. Let v 1 ; . . .; v K be a sequence of binary indicator vectors for each of the K (predicted) domains, let V D be the matrix whose columns are the v i , hence V D is a binary matrix with dimensions L × K.The requirement that no residue can be assigned to more than one domain ensures that the rows of this matrix are K-dimensional one-hot vectors indicating domain assignments for each residue.Given this matrix of domain assignments, V D , the elements a ij of the adjacency matrix A D are generated as (3) Thus our maximization problem becomes to find a set of K vectors, such that the probability of domain assignment induced by the vectors is maximized where the optimal number K of domains is itself unknown a priori and is therefore determined jointly with the domain assignments.The overall DomainAssignerð ÂÞ procedure is defined in Algorithm 1.Note that the overall computation time is kept low by maximizing the log probability which decomposes into a sum of log probabilities over individual entries in the adjacency matrix.This means that the change in probability by changing a single residue j to be assigned to domain k, where V 0 has been modified only in row j, can be computed in runtime which is linear with the number of residues.Our domain assignment algorithm proceeds as follows.First V D is initialized as an L × K init matrix of zeros, where K init is an initialization for the number of domains (K init ¼ 4).For each residue in turn, we score each of the K possible domain assignments using the score p Â ðV D V D T Þ, and assign the residue to the domain which produces the maximal score, as long as the maximal score is greater than the score under the Algorithm 1 DomainAssigner( Â)  current assignment.In the case of ties, the first domain is selected.After a complete pass through the sequence, the process is iterated a number of times to allow for corrections in initial assignments.As soon as some residue is assigned to the Kth domain, we add an extra column of zeros, corresponding to an 'overflow' domain which can be subsequently assigned to.This allows the algorithm to predict an arbitrary number of domains.The procedure is summarized in Algorithm 1.We note that the algorithm is incentivised to predict the correct number of domains: given a perfect predictor of A, then for any ground truth adjacency matrix A D , all assignments other than the correct assignment are guaranteed to receive lower scores, including assignments which over-or underpredict the number of domains.Indeed for a perfect predictor, our algorithm is guaranteed to recover the correct domain assignment.Not all residues have to be assigned to a domain.The algorithm only makes an assignment if the score of the best possible domain assignment is greater than the score under a null assignment, in which the residue is not assigned to any domain.Since each residue is initialized with a null assignment (corresponding to a row of zeros in V), residues for which no better assignment is found will maintain a null assignment throughout the procedure.

Uncertainty quantification
A natural approach to uncertainty quantification is to consider the output of the neural network Â as an L × L multivariate Bernoulli distribution.Then we can consider the output of the final assignment A 0 as an observation from the Â distribution and calculate the likelihood (normalized by the number of residues).Figure 4 compares instances where the model has high confidence (Fig. 4a) with other cases where the model confidence is lower suggesting that alternative domain assignments may be valid (Fig. 4b).
We observe that Chainsaw's confidence score has a good correlation with the ground truth accuracy.The confidence score achieves a Spearman's correlation score of 0.68 with the IoU score when measured on the Chainsaw CATH test set.On this test set, the confidence score ranged from 0.51 to 1.0.Using a confidence score cutoff of 0.85 will increase the probability that a domain is predicted correctly (IoU score ≥0:8) from 0.78 to 0.9 at the expense of introducing a 5% chance that a correctly predicted domain is discarded.

Protein domain segmentation with fully convolutional neural networks
The curated domain annotations in databases such as CATH provide a rich source of signal for training deep learning methods to segment protein structures into their constituent domains.We therefore first considered how to exploit such annotations as training data by formulating domain segmentation as a supervised learning problem.In particular, we sought to avoid some of the problems with training a network to solve a residue-level binary classification task of identifying residues at the boundary between domains.These problems include the handling of discontinuous domains, severe class imbalance of boundary to non-boundary residues, and sensitivity to small changes in the boundary.The latter is a problem given that in many cases, multiple adjacent residues could equally be considered to be the 'correct' boundary.In addition, for training a neural network the domain label target representation should have some desirable properties such as being invariant to indexing or ordering of the domains and the dimensionality of the label should not depend on the number of domains.Our solution is to define a classification task over pairs of residues, from which domain assignments for individual residues can be recovered.To derive this pairwise task, we start by observing that the domain assignments of a protein of length L can be uniquely represented by an L × L matrix A, whose entries a ij are 1 if residues i and j are in the same domain and 0 otherwise.For domains with residues that are continuous in sequence, this will result in an adjacency matrix that takes the form of a block-diagonal matrix, (see Fig. 5a).Domains which are discontinuous in sequence result in blocks occurring in the offdiagonal (see Fig. 5b).Given the matrix A, residue-level domain assignments can be recovered by interpreting the matrix A as the adjacency matrix of a graph and partitioning it into a set of K-connected components, where K is the number of domains.Importantly, this procedure works identically for domains that are continuous or discontinuous in sequence.Therefore, supposing we were able to produce a perfect predictor of the matrix A of pairwise domain co-membership between residues in a given structure, we could then unambiguously read off residue-level domain assignments for residues in either continuous or discontinuous domains.As a training label for a neural network, this representation has the advantage of being permutation invariant with respect to domain labels (there is no indexing suggesting an ordering of the domains) and the dimensionality is determined by the number of residues alone rather than being dependent on the number of domains.

Chainsaw
The main component of Chainsaw is a CNN (Fig. 2) that is trained to predict the matrix A for a given input structure, by predicting whether each pair of residues belongs in the same domain.The network takes as input a set of pairwise representations derived from the protein's 3D structure, consisting of the pairwise α-carbon residue distances and predicted secondary structure segments (Section 2.2), and outputs estimated probabilities that pairs of residues belong in the same domain.The final domain assignments are then determined by a search algorithm that maximizes the likelihood of the assignments under the predicted pairwise co-membership probabilities (Section 2.4).As training data, we obtained a set of PDB structures with associated domain annotations from CATH, which were converted into the corresponding matrices A to serve as targets in a binary classification task for residue pairs (Section 2).

Chainsaw performance on experimental PDB structures
We benchmarked Chainsaw against three unsupervised methods, UniDoc, PUU, and SWORD and two supervised methods, Merizo and EguchiCNN.Both Merizo and UniDoc were previously shown in Lau et al. (2023) to significantly outperform previous supervised learning methods EguchiCNN (Eguchi and Huang (2020) and DeepDom (Jiang et al. 2018).It should be noted that EguchiCNN was trained for the more specific task of classifying domains into superfamilies with generic segmentation as a pre-processing step and DeepDom is a structure-free predictor.We constructed a benchmark dataset of 1365 protein chains from the PDB using CATH domain annotations as ground truth.Our test dataset was constructed to ensure an equal balance of single and multi-domain proteins, this is motivated by findings in our other work (Lau et al. 2024) where we find that applying structure-based domain parsers to the AFDB suggests the following proportions: 3% zero-domain, 42% single-domain, and 55% multi-domain.Using our test dataset, we first compare Chainsaw against two unsupervised methods, UniDoc and PUU and one supervised method, EguchiCNN (Fig. 6a).We note that EguchiCNN may have been trained on some proteins from our test set which could artificially inflate its performance.We cannot present results for SWORD on this benchmark as it failed to output results on a significant proportion of the test set.We assess domain prediction accuracy via three metrics: the intersection-over-union (IoU) Lau et al. (2023), the proportion of correctly parsed domains (domainlevel IoU ≥ 0.8), and the domain boundary distance score (Tress et al. 2007).Calculation details for all metrics are provided in the Supplementary Material.Chainsaw achieves an average intersection-over-union (IoU) score of 0.88 versus 0.84 for the next-best competitor method UniDoc (Fig. 6a).If we restrict our attention to multi-domain proteins only, the performance gap increases, with Chainsaw scoring 0.83 versus 0.76 for UniDoc.To compare against Merizo, we create a subset of the test data for which there is no domain in a CATH superfamily that also occurs in the Merizo training data (n ¼ 208).On this subset, Chainsaw achieves an average IoU of 0.91 versus 0.83 for Merizo (Fig. 6b).On this dataset, we find Merizo oversplits single domain structures in 28% of cases versus 10% for Chainsaw, plausibly because Merizo was trained solely on multi-domain proteins in CATH.We note that if we only compare methods on multidomain proteins, the performance of Chainsaw and Merizo is the same, though the small sample size (n ¼ 52) limits the power of our analysis.To further test whether the performance differences between Chainsaw and Merizo can be attributed entirely to differences in the training data, we separately trained a Chainsaw model from scratch using the Merizo training data and evaluated it on the Merizo test data (which only contains multi-domain proteins).Following this approach, we observe that Chainsaw still outperforms Merizo, suggesting that the differences are not solely driven by the difference in training data (Fig. 6c).As a final test, to see how Chainsaw performs on domain annotations which are not from CATH, we also evaluated a model on domain annotations from CASP6 (n ¼ 63) and find that the Chainsaw model still outperforms UniDoc, PUU, SWORD2, and Merizo (see Supplementary Material).

Chainsaw performance on AlphaFold models
We sought to evaluate Chainsaw's performance on predicted structures in the AlphaFold database (AFDB).This is challenging because we lack a ground-truth domain assignment on AlphaFold models.One approach is to map CATH annotations from PDBs to AlphaFold models with matching sequences.Following this approach, we observed no significant change in the performance of Chainsaw when predicting on AlphaFold models as opposed to their corresponding PDB files (Supplementary Material).However, this evaluation approach only considers AFDB structures in the PDB and are, therefore, typically well-modelled.In general, AFDB structures are notably different from experimentally resolved structures in the PDB.The most significant difference concerning domain parsing is the presence of long segments of residues with no apparent secondary structure (see, e.g.Fig. 7b and h).To evaluate performance on AlphaFold models, a random sample of 200 human protein structures was taken from the AlphaFold database.A naive sampling would have resulted in relatively few large proteins, so to mitigate this, we sampled structures equally from binned protein Faults considered were: under-splitting (see Fig. 7a-d), oversplitting (see Fig. 7e, f, i), incorrect boundaries (see Fig. 7g,  c), missing domains, and falsely identifying domains (see Fig. 7h).UniDoc cannot predict non-domain residues.This would result in very poor performance when run on AlphaFold models which contain many poorly modelled segments with no predicted secondary structure.To improve UniDoc's predictions, residues with low predicted confidence by AlphaFold (predicted Local Distance Difference Test < 70) were removed from the domain predictions.In our evaluation protocol, evaluators compare the number and severity of segmentation faults to judge which segmentation is preferable.We also allow the option to judge the domain parsings of equal quality.We find that the Chainsaw domain parsing was preferable in 24% of cases, UniDoc was preferable in 11% of cases and both parsings were of equal quality in 65% of cases (Fig. 6d). Figure 7 showcases a selection of value < 1e-10) to an existing CATH S60 domain.This amounts to 2567 predicted domains where we can infer homology to existing domains.Following the same procedure for UniDoc, we note that the UniDoc þ Foldseek pipeline matches a similar number of predicted domains (2451), however, 28% of Chainsaw þ Foldseek domain matches are not discovered using the UniDoc þ Foldseek pipeline and 24% of UniDoc þ Foldseek domain matches are not discovered by the Chainsaw þ Foldseek pipeline.A 'domain match' is defined here as identifying the same CATH S60 representative domain in a given AlphaFold model.This suggests that Chainsaw and UniDoc are complementary methods which can be combined to recover more correct domain parsings than either method individually.We note, however, that the overlap between methods would likely be greater if we matched against CATH domains clustered at 35% sequence identity or considered the top-k Foldseek matches instead of only the top one.For the final analysis, we show that the Chainsaw þ Foldseek pipeline can infer functional annotations for L.infantum proteins that are uncharacterized.We start by considering the subset of the L.infantum proteome where UniProt lists the protein as 'uncharacterized' and there are no Gene Ontology (GO) annotations (Ashburner et al. 2000) associated with the protein.This amounts to 3280 out of 7924 L.infantum proteins.Of these, we find 413 proteins which have a Chainsaw predicted domain which matches against a CATH S60 representative domain with a Foldseek e-value of less than 1e-5.Of these matched proteins, 396 are Chainsaw matched to a CATH S60 domain with a functional annotation from Pfam (Mistry et al. 2021) or GO.Four of these are shown as case studies in Fig. 9.We note that this technique can detect structural homology in multi-domain proteins where the sequence identity is frequently less than 15% and for each of the showcased examples (Fig. 9), we checked that there were no Pfam, CATH FunFam, or Gene3D matches returned when running the sequences through InterPro Scan.

Conclusion
This study presents Chainsaw, a supervised learning approach to protein domain prediction.We leverage a residual CNN to estimate the probability that pairs of residues are in the same domain and combine this with an algorithm for converting the pairwise probabilities into domain assignments.Our approach outperforms state-of-the-art structurebased methods on both annotated PDB structures and AlphaFold models.Nonetheless, in our analysis of performance on AlphaFold models, we observe a significant proportion (11%) of cases where UniDoc domain parsings are preferable to Chainsaw and a majority of cases (65%) where the two predictions are of equal quality.In cases where the parsing is of equal quality, a significant proportion of these are proteins that could be parsed in multiple ways, with alternative segmentations by each method appearing equally valid.For these reasons, we conclude that a sensible strategy for domain identification involves using an ensemble of domain prediction methods.This approach enables confidence measures derived from model consensus and a diversity of possible predictions.The possibility of multiple valid domain segmentations within a single structure motivates future work to extract multiple predictions from Chainsaw.We observe that the Chainsaw neural network confidence score is correlated with prediction accuracy (see Section 2.5) and we see some hints that uncertainty in the output is indicative of alternative valid assignments.This insight opens up the potential for adapting the domain assignment algorithm to yield multiple assignments from a single network prediction.An advantage of Chainsaw when compared with models such as DPAM (Zhang et al. 2023) is the lack of dependencies on databases of known domains.We demonstrate a proof-ofconcept showing that domain prediction methods can be combined with structure and sequence matching algorithms to systematically identify domains and homology relationships in large databases of predicted structures such as AFDB.We further show that this approach can detect homologs where sequence-based methods cannot (Fig. 9).A natural extension of this work is to apply these techniques at scale and develop approaches for detecting novel domains.It is important to note that while we have trained Chainsaw solely using domain annotations from the CATH database, there are several other widely used domain classification resources available, including SCOP (Chandonia et al. 2021), Pfam (Mistry et al. 2021), and ECOD (Cheng et al. 2014).
The benchmarking we conducted primarily focuses on CATH, however, we are encouraged to see that Chainsaw has good performance on domain annotations from CASP, as well as on unannotated structures in the AlphaFold database.

Figure 3 .4
Figure 3. Input features for the Chainsaw neural network.

Figure 4 .
Figure 4. Chainsaw generates a confidence score which is the residueaveraged likelihood of A 0 under Â we find that this confidence measure has a good correlation with ground-truth accuracy.(a) High confidence model prediction where Â is very close to A 0 .(b) Low confidence prediction implies alternative assignments

Figure 5 .
Figure 5. Domain assignments represented with binary pairwise comembership matrices.(a) PDB 1b23P with continuous domains and corresponding pairwise domain label representation.(b) PDB 1a8eA with a discontinuous first domain

Figure 6 .
Figure 6.Assessing performance on CATH-annotated PDB structures and AlphaFold models.Bar plots show 95% CI.(a) Benchmarking on the Chainsaw CATH test set.� EguchiCNN may have been trained on similar or identical proteins in this test set.(b) Benchmarking on the subset of the Chainsaw CATH test set that is non-redundant with the Merizo training data.(c) Performance comparison on the Merizo test data.For these results, we trained a Chainsaw model from scratch using the Merizo training data and subsequently evaluated on the same test data as Merizo.One chain was removed from the Merizo test data due to ambiguous labelling in CATH.� EguchiCNN may have been trained on similar or identical proteins in this test set.(d) Results from visually assessing Chainsaw and UniDoc domain parsings on 200 randomly selected AlphaFold models from the human proteome

Figure 8 .
Figure 8. Example PDB structures where Chainsaw combined with FoldSeek can identify two additional domains which have not yet been annotated by CATH.In each case, the single CATH-identified domain is shown on the left, Chainsaw's domain prediction is shown in the middle, and the matched CATH domains aligned with the Chainsaw predictions are shown on the right.(a) PDB structure 5ta1A.(b) PDB structure 5g48A.(c) PDB structure 6o0aA

Figure 9 .
Figure 9. Uncharacterized proteins (no Pfam or Go annotations) from the L.infantum proteome were parsed with Chainsaw.The predicted domains were subsequently searched against the CATH S60 domain representative structures using Foldseek.We show four examples (a, b, c, d) where we can infer novel functional annotations via structural homology with representative CATH domains.Sequence identity for the matches above ranges from 11% to 16%, which indicates why these homologous relationships were not detected with sequence-only methods.Figure d Shows that Chainsaw correctly parses the structure into four phytochelatin synthase domain repeats.This protein is common to multiple pathogenic organisms and has been considered a potential drug target due to the fact it has no human homolog (Ray and Williams 2011)